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Abstract 

Current speech recognition systems tend to be developed only for commercially 
viable languages. The resources needed for a typical speech recognition system in¬ 
clude hundreds of hours of transcribed speech for acoustic models and 10 to 100 
million words of text for language models; both of these requirements can be costly 
in time and money. The goal of this research is to facilitate rapid development of 
speech systems to new languages by using multilingual phoneme models to alleviate 
requirements for large amounts of transcribed speech. The GlobalPhone database, 
which contains transcribed speech from 15 languages, is used as source data to derive 
multilingual phoneme models. Various bootstrapping processes are used to develop 
an Arabic speech recognition system starting from monolingual English models, In¬ 
ternational Phonetic Association (IPA) based multilingual models, and data-driven 
multilingual models. The Kullback-Leibler distortion measure is used to derive data- 
driven phoneme clusters. It was found that multilingual bootstrapping methods out¬ 
perform monolingual English bootstrapping methods on the Arabic evaluation data 
initially, and after three iterations of bootstrapping all systems show similar perfor¬ 
mance levels. Applications of this research are in speech recognition, word spotting, 
information retrieval, and speech-to-speech translation. 


IV 



Acknowledgements 

I would first like to thank Drs. Slyh and Anderson for the many hours spent discussing 
ideas and implementation issues to allow this research to progress to this point and for 
helping brainstorm research paths which lead to this work in the first place. Thank 
you also to Drs. Gustafson and Colombi for timely and valuable feedback during the 
experimentation time and for review of this thesis. 

1 would like to thank Dr. Bryan Pellom, formerly of the University of Colorado 
at Boulder, for his time developing the SONIC speech recognition system and for his 
time spent discussing the inner workings of SONIC with me and how best to use it 
to achieve the goals of this research. 

I would like to thank Dr. Grant McMillan, my supervisor, for allowing flexible 
work hours to allow these graduate studies to take place. 

I would also like to thank Mr. Brian Ore for work and discussions on the for¬ 
matting of the GlobalPhone database. Thanks go to Mr. Dave Hoeferlin for keeping 
the computer systems up and running during these experiments. 

Thanks go to Dr. Tanja Schultz for collecting the GlobalPhone database and 
allowing for its use by the speech community. 

Special thanks to my wife who has supported me and been at my side during 
this entire process. 


Eric G. Hansen 


v 



Table of Contents 

Page 

Abstract. iv 

Acknowledgements. v 

List of Figures . viii 

List of Tables. ix 

List of Abbreviations. xi 

I. Introduction . 1 

II. Background. 3 

2.1 Speech Recognition. 3 

2.1.1 Continuous Speech. 3 

2.1.2 Large Vocabulary ASR. 5 

2.1.3 Speaker Independence . 5 

2.2 Units of Speech. 5 

2.2.1 Phonemes in Context. 6 

2.3 Multilingual Research. 7 

2.3.1 International Phonetic Alphabet. 7 

2.3.2 Multilingual vs. Monolingual Speech Recognition 7 

2.3.3 Data-Driven Approaches. 9 

2.4 Automatic Speech Recognition . 9 

2.4.1 General Signal Processing. 9 

2.4.2 Mel Frequency Cepstral Coefficients. 11 

2.4.3 Pronunciation Lexicon or Dictionary. 12 

2.4.4 Language Model . 12 

2.5 Hidden Markov Models. 13 

2.5.1 HMM Training. 14 

2.5.2 Baum-Welch Re-estimation . 15 

2.5.3 Viterbi Algorithm and Decoding. 16 

2.6 The SONIC Speech Recognition System. 16 

2.6.1 Speech Detection and Feature Representation . 17 

2.6.2 Acoustic Model. 17 

2.6.3 Monophone Acoustic Models. 20 

2.6.4 Triphone Acoustic Models. 20 

2.6.5 Model Adaptation . 21 

2.6.6 Porting SONIC to Other Languages. 22 

vi 


































Page 

III. Experimental Results. 24 

3.1 Language Inventory . 24 

3.1.1 Details of Languages used in this Research ... 24 

3.1.2 Description of the Arabic partition of GlobalPhone 24 

3.2 Multilingual Phoneme Set. 25 

3.3 Performance Metrics. 25 

3.3.1 Phoneme Error Rate - Equally Likely Phonemes 28 

3.3.2 Word Error Rate. 28 

3.3.3 Phoneme Error Rate - Word Language Model . 29 

3.3.4 Phoneme Confusion Matrix. 29 

3.4 Bootstrapping from English Results. 30 

IV. Porting SONIC to Arabic. 32 

4.1 Bootstrapping from English. 32 

4.2 Bootstrapping from IPA-Based Multilingual Phonemes . 32 

4.2.1 Building ML-IPA Acoustic Models . 33 

4.3 Bootstrapping from Data-Driven Multilingual Phonemes 33 

4.3.1 Kullback-Leibler Distortion . 34 

4.3.2 Data-Driven Phoneme Clusters. 35 

4.3.3 Building the ML-DD Acoustic Models . 35 

4.4 Bootstrapping Results. 37 

4.5 Adapting Multilingual AMs to Arabic . 40 

4.6 Supplementing with IPA and Data-Driven Multilingual 

Data. 40 

V. Conclusions. 44 

5.1 Review . 44 

5.2 Future Work . 44 

Appendix A. Phoneme Confusion Matrices. 46 

Appendix B. Additional Information . 66 


Bibliography 


69 

























List of Figures 


Figure Page 

2.1. IPA Chart . 8 

2.2. Block diagram of an automatic speech recognition system. ... 10 

2.3. A typical left-to-right HMM. 14 

2.4. Process of calculating Mel Frequency Cepstral Coefficients ... 18 

2.5. Example HMM sequence for the word “one”. 18 

2.6. Example decision tree for the base phoneme /AA/. 21 

2.7. Block diagram of bootstrapping an ASR system from an initial 

language to a target language. 23 

4.1. Monophone AM results for Bootstrapping Experiments .... 39 

4.2. Triphone AM results for Bootstrapping Experiments. 39 

4.3. Monophone AM Adaptation PER-ELP Results. 41 

4.4. Triphone AM Adaptation PER-ELP Results . 41 

4.5. Monophone AM Supplementation PER-ELP Results. 43 

4.6. Triphone AM Supplementation PER-ELP Results . 43 


viii 













List of Tables 

Table Page 

3.1. Multilingual phonemes as listed in the GlobalPhone dictionaries 26 

3.2. Continuation of the Multilingual phonemes as listed in the Glob¬ 
alPhone dictionaries. 27 

3.3. Error rates on four languages in the GlobalPhone database using 

the bootstrapping from English method and monophone AMs . 31 

3.4. Error rates on four languages in the GlobalPhone database using 

the bootstrapping from English method and triphone AMs . . 31 

3.5. Error rates on three languages in the GlobalPhone database as 

stated in Schultz. 31 

4.1. Data-driven phoneme groupings based on the KL-distance metric 

for various thresholds. 36 

A.l. Phoneme Confusion Matrix for monophone AM bootstrapped 

from English AM with zero iterations. 50 


A.2. Phoneme Confusion Matrix comparing differences in monophone 

AM bootstrapped from English AM with zero iterations to mono¬ 


phone AM bootstrapped from ML-IPA. 51 

A.3. Phoneme Confusion Matrix comparing differences in monophone 

AM bootstrapped from English AM with zero iterations to mono¬ 
phone AM bootstrapped from ML-DD10. 52 

A.4. Phoneme Confusion Matrix for monophone AM bootstrapped 

from English AM with three iterations. 53 

A.5. Phoneme Confusion Matrix comparing differences in monophone 


AM bootstrapped from English AM with three iterations to mono¬ 


phone AM bootstrapped from ML-IPA. 54 

A.6. Phoneme Confusion Matrix comparing differences in monophone 

AM bootstrapped from English AM with three iterations to mono¬ 
phone AM bootstrapped from ML-DD10. 55 

A.7. Phoneme Confusion Matrix for triphone AM bootstrapped from 

English AM with zero iterations. 56 


ix 













Table Page 

A.8. Phoneme Confusion Matrix comparing differences in triphone 

AM bootstrapped from English AM with zero iterations to tri¬ 
phone AM bootstrapped from ML-IPA. 57 

A.9. Phoneme Confusion Matrix comparing differences in triphone 

AM bootstrapped from English AM with zero iterations to tri¬ 
phone AM bootstrapped from ML-DD10. 58 

A. 10. Phoneme Confusion Matrix for triphone AM bootstrapped from 

English AM with three iterations. 59 

A. 11. Phoneme Confusion Matrix comparing differences in triphone 

AM bootstrapped from English AM with three iterations to tri¬ 
phone AM bootstrapped from ML-IPA. 60 

A. 12. Phoneme Confusion Matrix comparing differences in triphone 

AM bootstrapped from English AM with three iterations to tri¬ 
phone AM bootstrapped from ML-DD10. 61 

A. 13. Phoneme Confusion Matrix comparing differences in monophone 

AM bootstrapped from English AM with three iterations to this 
AM supplemented with monophone data from IPA labels. ... 62 

A. 14. Phoneme Confusion Matrix comparing differences in monophone 

AM bootstrapped from English AM with three iterations to this 
AM supplemented with monophone data from ML-DD5b labels. 63 

A. 15. Phoneme Confusion Matrix comparing differences in triphone 

AM bootstrapped from English AM with three iterations to this 
AM supplemented with triphone data from IPA labels. 64 

A. 16. Phoneme Confusion Matrix comparing differences in triphone 

AM bootstrapped from English AM with three iterations to this 
AM supplemented with triphone data from ML-DD5b labels. . 65 

B. l. American English phoneme set used by SONIC. 66 

B.2. Number of words in each language’s dictionary. 66 

B.3. Count and average duration of each multilingual phoneme in the 

GlobalPhone test subset. 67 

B.4. Continuation of the count and average duration of each multilin¬ 
gual phoneme in the GlobalPhone test subset. 68 


x 













List of Abbreviations 

Abbreviation Page 

ASR Automatic Speech Recognition. 3 

HMMs Hidden Markov Models . 3 

AM Acoustic Model. 4 

LM Language Model. 4 

IPA International Phonetic Alphabet . 7 

DD Data-Driven. 9 

MFCCs Mel Frequency Cepstral Coefficients . 11 

DCT Discrete Cosine Transform. 11 

CMS Cepstral Mean Subtraction . 12 

GMMs Gaussian Mixture Models. 13 

SAD Speech Activity Detection. 17 

EM Expectation Maximization. 19 

SMAPLR Structural Maximum a Posteriori Linear Regression .... 22 

TL Target Language. 22 

MLLR Maximum Likelihood Linear Regression. 22 

PER Phoneme Error Rate. 23 

WER Word Error Rate. 23 

OOV Out of Vocabulary. 25 

ELP Equally-Likely Phonemes . 28 

WLM Word Language Model. 29 

ML Multilingual. 33 

KL Kullback-Leibler. 34 

KLDM Kullback-Leibler Distance Measure. 34 


xi 

























Multilingual Phoneme Models for Rapid 
Speech Processing System Development 

I. Introduction 

The concept of a machine that is able to understand human speech has been 
around since before computers were developed. The thought of using a machine 
to translate from one language to another has been around nearly as long. The 
computing power now available in portable devices may one day lead to the universal 
language translators that are currently science fiction. 

A key component in automatic speech translation systems first recognizes the 
foreign speech, and this recognition is the focus of this research. While much money 
and research has gone into building speech recognition systems in English and other 
commercially viable languages, there is a need to be able to rapidly build speech 
recognition systems in other languages as economic and political landscapes change. 
A recent example of this requirement is the tsunami that occurred in Indonesia in 
2004. The U.S. sent troops for aid relief, and the call was put out for tools to help the 
troops communicate with the local population. At the time, there were no Indonesian 
language speech recognition systems available, nor would there be any time soon due 
to the long development time required. The typical process for building a speech 
recognition system for a given language is to first collect hundreds of hours of speech 
data, transcribe this speech, and then to collect 10 to 100 million words of text for a 
language model. There is also a requirement for a pronunciation dictionary, and time 
is required for algorithm development and for building the speech recognition system. 
To allow speech recognition systems to be developed rapidly, especially for speech 
translation purposes, new processes must be investigated that can access the current 
pool of large speech resources in a few languages, and then share this information by 
porting to many languages, thereby minimizing the data collection stage of building 
a speech recognition system. 
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One recent idea is to use multiple languages for which there are data to support 
the development of new systems in other languages. Two approaches include mul¬ 
tilingual speech recognition systems that recognize multiple languages with one set 
of models and monolingual speech recognition systems that draw from a multilingual 
training space. The second approach is the focus of this research and it considers the 
various ways that multilingual data can be used to build phoneme models in other 
languages. 

This thesis is organized as follows. First, a background in speech recognition, 
units of speech, and past multilingual research is given. An overview follows of auto¬ 
matic speech recognition processes, including a description of Hidden Markov Models 
(used to model the phonemes) and specific details on the SONIC speech recognition 
system, which is the software used for all modeling experiments. Chapter three dis¬ 
cusses the experimental results using the GlobalPhone database, outlines the perfor¬ 
mance metrics and discusses the baseline experiments. Chapter four discusses various 
approaches to integrating the multilingual training data to build an Arabic speech 
recognition system and discusses the results of each experiment. Chapter five con¬ 
tains a summary of the work presented and highlights ideas for future investigation. 
The Appendix contains the Phoneme Confusion Matrices discussed in Chapter four. 
These matrices help provide a more thorough analysis of the results than average 
phoneme error rates provide. 
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II. Background 

This chapter addresses the main areas of statistical automatic speech recognition 
(ASR). First, an overview of terms used to describe speech recognition systems and 
basic concepts is given. Next, “units of speech,” specifically phonemes, are described. 
Then, previous multilingual research as related to this work is discussed. Finally, the 
different components of an ASR system are described in detail, including the inner 
workings of the SONIC speech recognition system, which is used for the experiments 
in this research. Special attention is given to Hidden Markov Models (HMMs) as they 
are the most common statistical models used for ASR and are used within the SONIC 
speech recognition system. 

2.1 Speech Recognition 

Recognizing speech is a difficult problem under even the best circumstances. 
Every person has a different style of speaking (accent and dialect), which of course can 
vary based on health reasons, emotional reasons, and the meaning of what the speech 
is supposed to convey. In addition, when background noise, or additional speakers are 
included, the challenge of speech understanding becomes greater. Finally the phrase 
context is everything can play a big role in speech recognition. The two phrases below 
contain similar phonetic information, but with different voicings convey two different 
meanings. 


Recognize Speech 
Wreck a nice beach 

Humans can easily distinguish between the two phrases, but we rely on years 
of learning and have adapted into excellent pattern recognizers. A computer must be 
trained to learn such differences. 

2.1.1 Continuous Speech. Continuous speech can be either read or spon¬ 
taneous, but it is much more difficult for automatic speech recognition systems than 
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isolated word recognition because of coarticulation effects as our articulators (tongue, 
teeth, lips, mouth, etc.) move from word to word. An audio signal is sampled at some 
set rate, and a sequence of feature vectors is extracted for a defined window of time. 
This feature vector sequence is the observation space, 


O — {oi, 02, ■■■On}, 

where N is the total number of observation frames. From this sequence, the goal is to 
determine the most likely word sequence that could have produced this observation 
sequence. Let W represent a sequence of hypothesized words uttered by the speaker, 


W = {W!,W 2 , 


Let W be the most likely sequence of words given the sequence of feature vectors, 
then 

W = argrnax P{W\0). 

W 

From Bayes’ Rule, 

p(w\o )= mg) . 

Given O, P(0 ) is constant for all possible word sequences, so 


W 


P(0\W)P(W) 

arg r x — P[0 )— 


argmax P{0\W)P(W). 

W 


The term P(0\W) is provided by the Acoustic Model (AM), which is estimated from 
transcribed speech training data. P(W) is derived from the Language Model (LM), 
which characterizes the probability of observing a sequence of words based on prior 
knowledge and is estimated from textual training data. This model is the basis for 
modern statistically-based speech recognizers, not rule-based systems. The focus of 
this research is to investigate different strategies to derive P{0\W). 
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2.1.2 Large Vocabulary ASR. Large vocabulary is a term used to describe 
modern continuous speech ASR systems and is in contrast to “command and control” 
recognition applications that have a very limited vocabulary. Ten years ago, 1000 
words was a large vocabulary. Today, a 65,000 or even 100,000 word vocabulary is 
common. Careful search strategies are required to traverse this increased space, and 
trade-offs are ever present between accuracy, vocabulary size, and algorithm speed. 

2.1.3 Speaker Independence. A speaker-independent speech recognition sys¬ 
tem is capable of recognizing speech from any speaker, including speakers outside the 
training space. To achieve speaker independence, a wide variety of training data is 
used to provide a good representation of the “speaker space” including accent and 
dialect. In building a speaker independent model, accuracy is lost for any one speaker 
because of model generalization. Speaker-dependent systems are tuned to a specific 
group of speakers (gender-specific or speaker-specific for example) and can improve 
speech recognition performance when the models are applied to the proper group. 
Speaker adaptation techniques can be employed, and given a small amount of data 
can readily adapt the speaker-independent models to speaker-dependent models for 
improved performance. A trade-off exists in flexibility and performance. 

2.2 Units of Speech 

Words are the most natural units of speech on which to build statistical models 
for speech recognizers, and for small vocabulary tasks whole-word models are some¬ 
times used. However, due to their lack of generality, word models are not ideal for 
large vocabulary tasks, as they require large amounts of training data. Syllables are a 
smaller level unit of speech, but syllables also suffer from a lack of generality. A still 
smaller unit of speech is the phoneme. It is standard procedure to build large vocab¬ 
ulary speech recognizers using statistical models of phonemes, especially of phonemes 
in various left and right phonetic contexts. 
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A phoneme is the smallest meaningful contrastive unit in the phonology of a 
language [27]. The sounds associated with each phoneme usually have some articula¬ 
tory gesture(s) in common. Each language has its own set of phonemes. For example, 
English has roughly 50 phonemes, while Turkish has approximately 30 phonemes. 
Phonemes can be grouped together into two primary categories, consonants and vow¬ 
els. Consonants can be further broken down into nasals, plosives, fricatives, approx- 
imants, trills, and flaps. Vowels are broken down by where the tongue is positioned 
within the mouth cavity and the shape of the lips that produces a particular vowel. 

2.2.1 Phonemes in Context. The definition of each phoneme can be de¬ 
scribed by specific positions of the articulators (tongue, lips, teeth, etc). However, 
it takes time to move the articulators from one position to the next, and often the 
proper position for a given phoneme is never reached as the articulators finish one 
phoneme and are already moving on to produce the next phoneme. In saying the 
word “happy”, while the /h/ is being spoken it will be influenced by /a/, (known as 
a anticipatory coarticulation ), which in turn, is modified as the articulators position 
themselves for /p/. The /a/ in “happy” will have different characteristics than the 
/a/ in “cat” even though they are the same phoneme. The degree of coarticulation 
between two sounds is dependent, but is not restricted to, the interval between ad¬ 
jacent sounds. The phoneme /k/ has a substantial amount of lip rounding when the 
next sound is round as in “coo”, but if there is a separation between the /k/ and the 
rounding, as in “clue”, the /k/ is less rounded. 

Two types of models are investigated, monophone and triphone models. Mono¬ 
phone models treat each phoneme as an independent sound, and all instances of a 
given phoneme are treated equally (i.e., the phonemes are context-independent). Tri¬ 
phone models are groups of models for the same base phoneme that differ based on 
the phonemes that occur before and after the base phoneme, thus taking into account 
the immediately preceding and following phonetic contexts (i.e., the phonemes are 
context-dependent). 
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2.3 Multilingual Research 

2.3.1 International Phonetic Alphabet. The International Phonetic Alpha¬ 
bet (IPA) was developed in the late 1800’s to create a separate symbol for each 
contrastive sound occurring in the human language. Figure 2.1 shows the latest revi¬ 
sion of the IPA [17]. The symbols on the chart are modified Greek and Latin letters. 
Between the main symbols and the diacritic marks (see the bottom of the chart), all 
known sounds of the languages of the world can be represented. Looking at the top 
most table of consonants, one can see the “manner of articulation” in the rows and 
the “place of articulation” in the columns. The grayed-out sections are judged to be 
impossible to humanly produce. A closeup of the vowel section of the chart shows the 
rows defining the position of the tongue at the roof of the mouth, while the columns 
represent the position of the tongue at either the front or the back inside the mouth 
cavity. The rest of the symbols on the chart are used to modify the base symbols 
and hence create a huge range of sounds. If a Russian word is transcribed with the 
phoneme /n/ and a French word is also transcribed with the phoneme /n/ it can 
be said that the two words share the same basic sound. However, in reality there 
can be variations among the same phoneme across languages, so the IPA is really a 
categorical simplification of the phonetic context of languages which in truth can be 
more continuous in nature. 

2.3.2 Multilingual vs. Monolingual Speech Recognition. Much research has 
been conducted on the topic of multilingual speech recognition [1,3-5,8,14,43-45,47]. 
This past research utilizes the similarity of sounds across languages and efficiently 
builds acoustic models that could recognize multiple languages. In all cases there is 
a trade-off in performance for the flexibility of the multilingual acoustic models, but 
there is also success in finding sounds across languages that have many characteris¬ 
tics in common. In [26] cross-language approaches augment a new target language 
with existing source language acoustic information focusing on the adaptation and 
transformation from source language to target language and resulting in word error 
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THE INTERNATIONAL PHONETIC ALPHABET (2005) 


CONSONANTS (PULMONIC) 



Bilabial 

Labio¬ 

dental 

Dental 

Alveolar 

Post¬ 

alveolar 

Retroflex 

Palatal 

Velar 

Uvular 

Pharyngeal 

Epi- 

glottal 

Glottal 

Nasal 

m 


n 

n. 

ia 

mm 

N 





Plosive 

P b 


t d 

t 4. 

m i 

DS 

q g 


? 

Q 


Fricative 

<]) 0 

ED 




§ \ 

m 

ED 

91 

h T 


u a 

Approximant 


u 

a 

l 

j 

ui 

m 


m 



Trill 

B 


r 




R 





Tap, Flap 


V 

r 

l 








Lateral 

fricative 



* b 









Lateral 

approximant 



l 

l 

A 

L 






Lateral flap 



j 










Where symbols appear in pairs, the one to the right represents a modally voiced consonant, except for murmured /i. 
Shaded areas denote articulations judged to be impossible. 


CONSONANTS (NON-PULMONIC) 


Anterior click releases 
(require posterior stops) 

Voiced 

implosives 

Ejectives 

0 Bilabial fricated 

| Laminal alveolar 
fricated (“dental”) 

| Apical (post)alveolar 
* abrupt (“retroflex”) 
i Laminal postalveolar 
' abrupt (“palatal”) 
i| Lateral alveolar 

11 fricated (“lateral”) 

6 Bilabial 

_r Dental or 
U alveolar 

J* Palatal 
cj* Velar 
( f Uvular 

Examples: 
p’ Bilabial 

a.’ Dental or 

L alveolar 

k’ Velar 

» Alveolar 
^ fricative 


CONSONANTS (CO-ARTICULATED) 

M Voiceless labialized velar approximant 
W Voiced labialized velar approximant 

IJ Voiced labialized palatal approximant 

G Voiceless palatalized postalveolar (alveolo-palatal) fricative 
£ Voiced palatalized postalveolar (alveolo-palatal) fricative 
Q Simultaneous x and f (disputed) 

kp tS Affricates and double articulations may be joined by a tie bar 


VOWELS 

Front Near front Central Near back Back 



Vowels at right & left of bullets are rounded & unrounded. 


SUPRASEGMENTALS tone 


i it 


Primary stress 

Extra stress 

Level tones 

Contour-tone examples: 

i Secondary stress 

ifoond'tijdn] 

e 

"1 Top 

e 

A 

Rising 

61 Long 

6' Flalf-long 

6 

1 High 

e 

N 

Falling 

6 Short 

6 Extra-short 

e 

H Mid 

6 

1 

High rising 

. Syllable break 

Linking 

e 

-1 Low 

e 

\ 

Low rising 


(no break) 

« 

J Bottom 


'I 


INTONATION 

e 

e 

High falling 

1 Minor (foot) break 

Tone terracing 

e 

vl 

Low falling 

|| Major (intonation) break 

T 

Upstep 

e 

d 

Peaking 

/ Global rise 

\ Global fall 

J. 

Downstep 

e 

vl 

Dipping 


DIACRITICS Verities ma y placed above a symbol with a descender, as rj. Other ipa symbols may appear as diacritics to represent 
phonetic detail: t s (fricative release), b fi (breathy voice), 7 a (glottal onset), * (epenthetic schwa), o° (diphthongization). 


SYLLABIC1TY & RELEASES 

PHONATION 

PRIMARY ARTICULATION 

SECONDARY ARTICULATION 




Voiceless or 

Slack voice 

HI 

Dental 


Labialized 


More rounded 



ID 

Modal voice or 
Stiff voice 

ED 

Apical 

t> d ' 

Palatalized 

D 

Less rounded 

t h h t 

(Pre)aspirated 

n a 

Breathy voice 

ED 

Laminal 

tv dv 

Velarized 

e z 

Nasalized 

d n 

Nasal release 

n a 

Creaky voice 

HI 

Advanced 

Cd 9 

Pharyngealized 

a* y 

Rhoticity 

d 1 

Lateral release 


Strident 

09 

Retracted 

\ z 

Velarized or 
pharyngealized 

m 


f 


Q 4 



Centralized 

ill 

M id- 

centralized 

? 9 

Retracted 
tongue root 


Lowered (j3 is a bilabial approximant) 

e a 

Raised (J. is a voiced alveolar non-sibilant fricative) 


Figure 2.1: IPA Chart [17] 














































rate reductions. The approaches investigated in this research are drawn from this 
background of incorporating multilingual speech data in various approaches to derive 
new language speech recognition systems. Instead of using the similarity of the mul¬ 
tiple languages’ phonemes to build a multilingual speech recognizer, the multilingual 
data is used to help build a monolingual recognizer in a new and (theoretically,) data 
sparse language. 

2.3.3 Data-Driven Approaches. To address issues of using categorical based 
multilingual labeling standards (such as the IPA), data-driven (DD) approaches are 
investigated so phoneme models across languages can be grouped based on their acous¬ 
tical properties regardless of phonetic category. In [21] a distance measure between 
two phoneme models based on a relative entropy-based distance metric is proposed. 
In [23] phoneme models are grouped based on IPA labels and a log-likelihood approx¬ 
imated distance measure, and experiments are conducted on the SpeechDat database 
that contains speech from multiple languages. Finally, in [42] a data-driven approach 
to generate phonetic broad classes is taken using the phoneme confusion matrix to 
derive the phonetic classes. Results from these papers lead to further investigation 
into data-driven approaches discussed in this research. 

2-4 Automatic Speech Recognition 

Figure 2.2 shows an overview of a typical automatic speech recognition system. 
This section covers each block in the diagram and how each component works to 
create an ASR system. 

2-4-1 General Signal Processing. Speech digitally sampled at sixteen kHz 
results in a usable frequency bandwidth of eight kHz, which covers the majority of 
acoustic information carried by human speech. However, 16,000 samples per second 
results in a large number of samples, so the signal is parameterized with a much lower 
information rate. The signal processing front end extracts important information 
contained in the speech signal and ideally is designed to show consistent patterns for 
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Figure 2.2: Block diagram of an automatic speech recognition system. 

the same phonemes across speaker gender, age, accent, and dialect, and is channel 
independent. 

The first step in most speech processing systems pre-emphasizes the signal by 
applying a first order difference equation 

s(n) = s(n) — as(n — 1) 

to the samples s(n);n = 1, ...,N in each window of samples. In the above equation 
a is the pre-emphasis coefficient, which normally is between 0.9 < a < 1.0. This 
filtering process increases the energies of the high frequency spectrum to compensate 
for the approximately -6dB/octave spectral slope of the speech signal, which is mostly 
attributable to the glottal source (i.e., the airflow through the vocal chords during 
voiced speech) [16]. 

The second step windows the data. To avoid end-point problems due to signal 
truncation for a frame of speech, a weighting window is applied to smooth the end¬ 
points. Typical window smoothing options are Hamming , Hanning , or Raised Cosine 
windows. 
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2-4-2 Mel Frequency Cepstral Coefficients. Mel Frequency Cepstral Coeffi¬ 
cients (MFCCs) are the most common feature representation for speech processing. 
MFCCs are a combination of filter-bank analysis and cepstral analysis. Filter-bank 
analysis represents the signal spectrum by the log-energies at the output of a filter- 
bank, where the filters are overlapping band-pass filters spread along the frequency 
axis. This representation gives a rough representation of the signal spectral shape. 
The center frequencies of the filters are spread evenly on the Mel scale, which takes 
into account the relationship between frequency and “perceived” pitch and is related 
to how the human ear operates. The Mel warping is approximated by 

fMel = 2595/0^0(1 + 


The log-energy filter outputs with P filters are 


fN -1 


e\j\ = log £ Wj[k] | S Me i[k}\ for j = 1, 


, k=0 


where w[j\ represents the j th filter to the k th discrete frequency of the sampled signal 
s(n) and Sm £ i [fc] represents the DFT magnitude spectrum of s(n) warped onto the 
Mel frequency scale. 

The “cepstrum” is defined as the inverse Fourier transform of the logarithm 
of the Fourier transform. Cepstral coefficients represent the spectral envelope of the 
speech. In practice, M cepstral coefficients are obtained by decorrelating the filter- 
bank energies via a Discrete Cosine Transform (DCT) 


2 A 


ctf = \ t; 


P ^ 

3 = 1 


7TI 


e[j] cos— (j - 0.5) 


for i — 1,..., M. 


By decorrelating the features, one can easily use diagonal covariance HMMs, 
which in turn reduces the computational requirements of the models. 
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The final step to computing the MFCCs is to subtract the mean from each 
coefficient to account for amplitude differences; this process is referred to as Cepstral 
Mean Subtraction (CMS). 

2-4-3 Pronunciation Lexicon or Dictionary. A pronunciation lexicon, some¬ 
times referred to as a dictionary, contains pronunciations for the words in the recog¬ 
nition vocabulary. A pronunciation is a sequence of phonemes used to pronounce the 
word. The vocabulary of the recognizer is defined by the LM, which is discussed in 
the next section. The pronunciation lexicon may contain multiple pronunciations for 
the same word. An example of a word with two pronunciations is 

ACCIDENTAL: AE K S AX D EH N AX L 

ACCIDENTAL (2): AE K S AX D EH N T AX L 

2-4-4 Language Model. A statistical N-grarn LM is used to predict word 
selection for the recognizer output, 


P(wi,w 2 ,-,w k ) = P(w l )P(w 2 \w 1 )P(w 3 \wi,w 2 )-.P(w k \w 1 ,w k , ..., W k —i). 


For the experiments the LM is a trigram model with back-offs built using the CMU- 
Cambridge Language Modeling Toolkit [7]. “Trigram” means that word probabilities 
are derived from the base word and the two-word preceding context: 

P(W) = P{w 1 )P{w 2 \w 1 )P{w 2 \w 1 ,w 2 ). 

The term “back-off” refers to the language model backing off to either bigrams: 

P{W) = P( Wl )P(w 2 \ Wl ) 


or a unigram: 


P(W) = P( Wl ). 
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An example of a trigram is the word “recognition” given the word “speech” and given 
the word “automatic” preceding. If a trigram does not occur, the LM reverts (or 
backs down) to bigrams or unigrams. An example of a bigram is given the word 
“recognition”, the probability of the word “speech” preceding it. An example of an 
unigram is the probability of the word “recognition” in the training text [19]. 

2.5 Hidden Markov Models 

A brief description of Markov processes is presented in this section, followed 
by descriptions of the training and decoding algorithms for Hidden Markov Models 
(HMMs). Further details are in [20,25,31,32], 

A discrete Markov process has a finite number of states, N, which form a Markov 
chain. At discrete time intervals the system may undergo state changes according to 
a set of transition probabilities, where a transition into the same state is allowed. An 
n th order Markov process depends on the current state at time t as well as all its n 
previous states. 

Each state of the Markov chain has an associated random output function. 
These random functions are known as observation distributions or emission distribu¬ 
tions and typically are chosen to be Gaussian Mixture Models (GMMs) for ASR. At 
a discrete time instance, t, the Markov process is assumed to be in state s t , and an 
observation ot is generated according to the emission probability associated with the 
state. The system may generate a state change or it may stay in the current state at 
the next time instance. Thus, a Markov model generates a set of observations accord¬ 
ing to a set of transition probabilities and output distributions. If the state sequence 
of the underlying Markov chain is unknown or hidden from the observer, this is called 
an HMM. In speech recognition the state sequence is not directly observed, so HMMs 
are appropriate. Part of the task of speech recognition is estimation of the underlying 
state sequence. 
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Figure 2.3: Shows a typical left-to-right HMM. The model is entered from the 

left with an initial transition probability of entering the model in state 1 (si). The 
transition probabilities are shown as a t] where i is the current state and j is the next 
state. The output distributions b are modeled as Gaussians and would output an 
observation o 3 when state j is entered. 

There is no limit to the order of the Markov chain, but for speech recognition 
restriction to a first-order, left-to-right Markov process typically is made. In first-order 
processes the current state depends only on the immediately preceding state and no 
other history. This assumption is invalid for speech recognition, but it drastically 
reduces the complexity of the model and has been shown to give useful results. 

Figure 2.3 shows a typical three-state left-to-right HMM. The model is entered 
from the left with an initial transition probability of entering the model in state 1 
(si). The transition probabilities are shown as where i is the current state and j 
is the next state. The transition probabilities of the HMMs represent the statistical 
duration information of the phonemes. The output distributions bj are modeled as 
GMMs and output observation o* when state i is entered. 

2.5.1 HMM Training. For generative models the model A that most likely 
generated an observed sequence of observations O = must be determined. 

An iterative process known as the forward-backward algorithm is used. The forward 
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variable probability a t (i) is 


a t (i) =p{o u ...,o t ,s t = *|A). (2.1) 

Equation 2.1 gives the probability of observing the partial sequence of observations 
Oi, ...o t , up to time t and being in state i at time t given model A. For an iV-state 
model, initially 

axil) = 7rA(<h), 1 < i < N, 

where 7q is the probability of starting in state i at time t. For the left-to-right model, 
7I - ! — 1. By iterating and summing, a trellis of forward probabilities is generated for 
state j at time t+1: 

bj(Pt+ i)j 

where 1 < t < T — 1 and 1 < j < N. The forward probability that model A generated 
the full observation sequence 0 is 

N 

P(0|A) = 5>t( S ). 

i= 1 

A backward variable /3 t (i) is defined in a similar fashion, 

Pt(i) = p(o t + i , ...» o T , s t = i|A) 

or 

N 

Pt(i) = ^2 a i j b j io t+ i)Pt+i(J), 

3 =1 

2.5.2 Baum-Welch Re-estimation. A procedure is needed for adjusting the 
parameters of a model given the training data. An iterative method known as the 
Baum-Welch algorithm is commonly used for this purpose. The re-estimation formu¬ 
las of the Baum-Welch algorithm provide a method of recomputing the transition and 
emission probabilities of a HMM using the forward and backward probability equa- 
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tions shown in the previous section. Every iteration of the Baum-Welch re-estimation 
is guaranteed to increase the total likelihood of the model, (if using Gaussian Mixture 
output distributions), generating the observation sequence, p(0 |A), unless a maxi¬ 
mum is reached, at which point the likelihood remains constant. The proof of this 
property is in [25]. 

2.5.3 Viterbi Algorithm and Decoding. For decoding purposes, given an 
observed acoustic sequence it is necessary to End the maximum likelihood path (i.e., 
the best state sequence) through a composite model, which is a concatenation of 
phoneme models (HMMs) into a larger model (HMM network) constrained by the 
lexicon (dictionary) and the LM. Due to the high computational load of the forward 
probability calculations that find the best state sequence, a dynamic programming 
algorithm called the Viterbi algorithm [11] is typically used. The likelihood is com¬ 
puted using the forward probabilities, except that the summation is replaced by an 
argmax operation, 

= argmax{0 t _i(*)ajj}6j(ot), 
i 

where (f>t (j ) is the highest score along a path at time t, which accounts for the first t 
observations and ends in state j. The maximum likelihood approximation p(0\X) is 
then given by argmax i {0r(i)} for an IV-state model. A trace back through the trellis 
created by the Viterbi algorithm, selecting state j with the highest (j) at time t, yields 
the best state sequence (in reverse order). 

2.6 The SONIC Speech Recognition System 

SONIC is the University of Colorado’s Continuous Speech Recognizer. It is de¬ 
signed for research and development of new algorithms for continuous speech recog¬ 
nition. The rest of this section describes in detail each of the components in SONIC. 
All experiments discussed here were run using SONIC 2.0 beta5 [28]. 
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2.6.1 Speech Detection and Feature Representation. Speech detection is 
sometimes referred to as Speech Activity Detection (SAD) and is a process that 
marks regions of a speech utterance according to whether they are speech or non¬ 
speech (silence, background noise, cough, etc.). Non-speech segments are ignored for 
recognition. Within SONIC, the SAD is built around a two-state ffMM with one 
state representing speech and the other state representing non-speech. These ffMMs 
are pre-trained and are held constant through all experiments. For both SAD and 
ASR, SONIC parameterizes the speech into a feature vector. 

SONIC computes a 39-dimensional feature vector consisting of 12 MFCCs and 
the normalized frame energy along with the first- and second-order derivatives of the 
features for both modeling and decoding stages. The feature vector is calculated 
every 10 ms from a sliding window of 20 ms of audio. A block diagram of the feature 
extraction process is shown in Figure 2.4. In general, the process involves: 

• Pre-emphasize the signal by a factor of 0.97 

• Window 20ms of speech with a raised cosine window 

• Compute the FFT 

• Warp the frequencies to the Mel scale 

• Perform filter-bank analysis (Log of output filter energies) 

• Compute the Discrete Cosine Transform (decorrelate features) 

• Subtract the Cepstral mean (reduces channel mismatch) 

• Compute signal energy of the 20ms speech window 

• Compute the first and second order derivatives of all features 

2.6.2 Acoustic Model. The AM contains all information related to phonetics 
and channel condition (telephone, microphone, etc.). Each phoneme is represented 
by a three state continuous-density HMM. The models are continuous because of 
the underlying GMMs for each state. The three states of each HMM represent the 
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Figure 2.4: Process of calculating Mel Frequency Cepstral Coefficients (MFCCs). 



Figure 2.5: Example HMM sequence for the word “one” Each phoneme HMM 

has three states representing the beginning, middle, and end of the phoneme. Here 
O = Oi ,..., Oi 8 for 18 output observation vectors. 
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beginning, middle, and end of each phoneme. Figure 2.5 shows an example of an 
HMM network for the word “one.” Both male and female genders have their own 
set of phoneme models. The phoneme HMMs can be either monophone or triphone 
models. State transitions in SONIC are modeled using a two-parameter gamma dis¬ 
tribution [30] rather than the typical HMM state transition probabilities which by 
default have a geometric distribution. As reported in [2,18,33], using explicit state 
duration models with HMMs improves recognition accuracy. In [30], it is shown that 
the gamma distribution is used to fit the measured phoneme duration distribution, 
and this outperforms the standard methods of using HMM transition probabilities. 

Each HMM state within SONIC is represented by a mixture of M Gaussian 
distributions. In all experiments, the number of Gaussian distributions is fixed at 
32 per HMM state. Each 39-dimensional Gaussian is represented by a weight w m , a 
mean vector p m , and a covariance matrix. After processing the feature vector through 
the DCT, a diagonal covariance matrix af v is assumed. The likelihood calculation is 


M 

p(o t \\) = 

m=l 






Gt 




where p(o t \X) is the likelihood of observing the t th feature vector o t given the model 
A. As discussed, D is 39 in these experiments. Also, the weights, sum to 1: 


M 

y W m = 1- 

m= 1 


SONIC actually uses a Viterbi-based training algorithm. During model training, 
the frames of training data for each base phoneme are clustered to the HMM states 
by an Expectation-Maximization (EM) algorithm. The trainer code estimates HMM 
parameters in the maximum likelihood sense one base phoneme at a time using 10 EM 
iterations [28]. This method is not as thorough as the Forward-Backward algorithm 
discussed in Section 2.5.1, but Viterbi-based training is fast and efficient. 
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The training frames for each phoneme are defined by a Viterbi-based alignment 
process. Recall, the reference transcripts only exist at the utterance level, but SONIC 
uses the initial AM, the dictionary, and the transcripts to align the phoneme labels 
with the audio data. This alignment process is repeated at each iteration (of either 
bootstrapping or adaptation techniques) with the updated AMs. 

2.6.3 Monophone Acoustic Models. Monophone AMs, also known as context- 
independent models, model all instances of a given phoneme. The resulting mono¬ 
phone model is a three-state, 32-mixture HMM trained on all available frames of data 
for the particular phoneme being modeled. 

2.6.4 Triphone Acoustic Models. Triphone AMs, also known as context- 
dependent models, are clustered depending on the phoneme immediately to the left 
and right of the phoneme being modeled. The resulting triphone model consists of 
multiple three-state, 32-mixture HMMs, where each HMM has a list of phonemes for 
the left context and a list of phonemes for the right context. The training data for 
each HMM consists of only the training frames that fit the context. 

An example of the difference between monophone and triphone AMs is as fol¬ 
lows: Given the following three words, 

cat: K AE T 
hat: H AE T 
scat: S K AE T 

build a monophone and triphone AM for the phoneme /AE/. The monophone AM 
for /AE/ is trained on all feature vectors within the start and end times of the 
phoneme /AE/ for the three words, and results in a single three-state, 32-mixture 
HMM modeling /AE/. The triphone AM contains two three-state, 32-mixture HMMs 
with one HMM trained on the feature vectors of /AE/ that fall between /K/ and /T/, 
and another HMM trained on the feature vectors of /AE/ that fall between /H/ and 



/AA/ 



Figure 2.6: Example decision tree for the base phoneme /AA/ [28]. 

For a language that has 50 phonemes, there are 50 3 possible triphone units. 
However, not every context must appear, and sometimes there is not enough training 
data to properly model certain contexts. Typically, an automatic clustering method is 
used to reduce a triphone model to 5000-6000 triphone units. SONIC uses a decision 
tree method to cluster the contexts into similar acoustical characteristics. A binary 
decision tree is automatically derived based on certain rules such as which phonemes 
are voiced , which are nasals, etc. At each node of the tree these questions are asked, 
and the splitting continues until either the change in likelihood due to splitting is 
below a pre-determined threshold or until the number of frames assigned to the leaf 
node falls below a pre-determined minimum frame count. Feature vectors assigned to 
the leaf nodes are then used to estimate the actual HMM parameters. An example 
decision tree is shown in Figure 2.6. A theoretical discussion of decision tree methods 
for training acoustic models is in [24,29]. 

2.6.5 Model Adaptation. Model adaptation techniques modify the model 
parameters based on new data without discarding previously trained models. Two 
reasons to adapt models are highlighted here. First, adapting speaker independent 
models to gender dependent models has been shown to improve recognition perfor¬ 
mance. Second, language adaptation may be realized, where for example, a multilin- 
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gual model is adapted to a monolingual model. Depending on how the adaptation is 
implemented, the AMs can adapt too quickly to the new data (similar to disregarding 
the previous model), or not adapt quickly enough and thus never adequately model 
the new data. 

During the normal process of building AMs in SONIC, a gender-independent 
AM is built using all the alignments of the training data. Gender-dependent AMs are 
then adapted from this gender-independent AM by updating the weights and means 
of the HMMs using the gender-specific training data. If the gender of the test hie 
is known (which can be found by manual or automatic means), the corresponding 
gender-specific AM is used for the decoding of the audio, and this procedure has 
been shown to yield improved performance compared to a gender-independent AM. 
Gender-dependent AMs are used for all the experiments discussed in this research. 

The Structural Maximum a Posteriori Linear Regression (SMAPLR) method 
described by Siohan et al. [40,41] is used to adapt the initial AM to the target language 
(TL) AM. SMAPLR, (like Maximum Likelihood Linear Regression, MLLR) [12]), 
estimates a set of regression class transformations to maximize the likelihood of the 
adaptation data against the existing HMM model. SMAPLR and MLLR differ in 
that the number of regression transforms as determined by the amount of data and 
the phonetic content of the adaptation data for SMAPLR, where-as for MLLR the 
number of regression transforms are determined by the user. The regression class 
transforms in the implementation of SMAPLR in SONIC are used to transform the 
Gaussian mean parameters and to adjust the variances of the Gaussian distributions. 

2.6.6 Porting SONIC to Other Languages. Figure 2.7 shows an example of 
the steps needed to bootstrap a SONIC ASR system from one language to another. 
The first step uses an initial AM from a large resource language, such as English. 
The second piece of information needed for bootstrapping is a Phoneme Map File. 
The Phoneme Map File is used to map TL phonemes to the nearest initial language 
phonemes so that the AMs of the initial language can be used to create an initial 
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Figure 2.7: Block diagram of bootstrapping an ASR system from an initial language 
to a target language (TL). Here the initial AM is assumed to be English, hence the 
phoneme map hie contains mappings from the TL to English. The TL dictionary, 
audio, and text are needed for each training iteration. After each iteration of training 
the new AM is used to re-align the TL audio to create the next (better) iteration of 
bootstrapping. 

alignment of the TL speech. The initial alignments are used to build the initial TL 
AMs, which are then used to realign the text in an iterative fashion. Mappings are 
typically created manually based on 1PA. Tables 3.1 and 3.2 show the mappings used 
in this research for several TLs to English. It is allowable to have a single initial 
language phoneme map to multiple TL phonemes (as in /a/ and /a.l/ in German both 
mapping to English /AA/). After the first alignment of the acoustic data, the TL 
phoneme models are all distinct. Once these two components are in place, the TL 
dictionary is used in conjunction with the Viterbi algorithm to time align the audio 
data with the transcripts. These alignments are then used to train the TL AM for 
iteration zero. With these new models the Viterbi alignment is repeated using the 
same TL dictionary, audio data, and transcript hies as before. After each iteration of 
these steps, the previous TL AM is discarded and the new TL AM is trained with the 
updated alignments. After three iterations of this process, the TL AM is considered 
final, and this is the model used for decoding the test set data, which results in 
Phoneme Error Rate (PER) and Word Error Rate (WER) values. 
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III. Experimental Results 


3.1 Language Inventory 

The GlobalPhone Database consists of 15 different languages of high quality read 
speech data with the source text being news websites [34], The number of speakers 
vary with language, and the amount of speech data varies from 16 to 33 hours per 
language. Because the audio data is read from on-line texts, the transcriptions are 
extracted from the original text. Transcripts are available at the utterance level only. 
Word and phoneme alignments must be derived by automatic means. 

3.1.1 Details of Languages used in this Research. A subset of the Global- 
Phone languages is chosen for this research to include Croatian, German, Japanese, 
Turkish, and Arabic. For each of these languages an ASR system is ported using 
SONIC according to the method of Section 2.6.6 starting with English AMs, but the 
Arabic partition is the focus for multilingual porting methods. Narrowing the exper¬ 
iments to Arabic allows five diverse languages to be used for training purposes - 
namely, Croatian, English, German, Japanese, and Turkish. This subset of languages 
is chosen from GlobalPhone as being somewhat diverse and yet with good (but not 
total) coverage of all the phonemes of Arabic. Another reason for choosing this subset 
of languages (with the exception of English, which is not included in GlobalPhone) 
is that each of these languages has a pronunciation dictionary in the IPA notation, 
which allows the proposed multilingual experiments. 

3.1.2 Description of the Arabic partition of GlobalPhone. The Arabic train¬ 
ing data consists of 14.5 hours of speech data spread across 68 different speakers (39 
female, 29 male). The evaluation set has 2 hours of speech data (589 utterances) 
spoken by five different speakers (2 female, 3 male). The dictionary and transcripts 
are in a Romanized format. An example is below. 

Example transcript: KaARiThTei AL-BuWYiNGh NYuWYuWRK 
Translation: “Collapse of the Boeing in New York” (a headline phrase). 
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Example dictionary entries: 

KaARiThTei: k a al r i T t i 
AL-BuWYiNGh: al 1 b nl j i n rr 
NYuWYuWRK: n j nl j nl r k 

The pronunciation dictionary contains all words in both training and evalua¬ 
tion sets (i.e., no out-of-vocabulary 00V instances). The dictionary released with 
GlobalPhone does not account for all the words in the transcripts, but an automatic 
letter-to-sound rule system is trained according to the procedures of [28] to account 
for the missing words. Some pronunciations are still missing after this step, so an 
in-house language expert manually updates such pronunciations. 

3.2 Multilingual Phoneme Set 

Each language of the GlobalPhone Database has a dictionary, or pronunciation 
lexicon, for each of the words of the utterances covered by that section of the database. 
The dictionaries include an ASCII representation of each phoneme for each language. 
All of the languages chosen for this research have the pronunciations listed in a mul¬ 
tilingual phoneme representation. A list of these phonemes is in Tables 3.1 and 3.2. 
Each row signifies a multilingual phoneme as listed in the GlobalPhone dictionaries. 
The last column “ENG” lists a “close” English phoneme mapping in terms of the En¬ 
glish phoneme set used by SONIC. See Appendix B for examples of the sounds these 
English phonemes represent which is taken from [28]. There are some rows where the 
only multilingual phoneme listed falls under the Arabic column, which means that 
phoneme is not represented in the four non-English languages used as training data. 

3.3 Performance Metrics 

Various performance metrics are used to compare the various acoustic modeling 
approaches tested. Word and phoneme error rates are the most widely used metrics 
to compare ASR system performance. However, for phoneme recognition, overall 
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Croatian 

German 

Japanese 

Turkish 

Arabic 

English 
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1 

1 

1 

1 
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m 

m 

m 

m 

m 

M 





ml 
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Table 3.1: Multilingual phonemes as listed in the GlobalPhone dictionaries for 

Croatian, German, Japanese, Turkish, and Arabic. The phonemes listed for English 
are from the phoneme set used in the SONIC ASR system. 




Croatian 

German 

Japanese 

Turkish 

Arabic 

English 

n 
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n 
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ng 




NG 

nj 
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OW 
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ER 
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ue 
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UW 
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V 
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UH 

X 
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X 

HH 

z 

z 

z 

z 

z 

Z 
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ZH 




Z 

Z 

DH 


Table 3.2: Continuation of the Multilingual phonemes as listed in the GlobalPhone 
dictionaries for Croatian, German, Japanese, Turkish, and Arabic. The phonemes 
listed for English are from the phoneme set used in the SONIC ASR system. 




error rates collapse the performance of all the phonemes into one value, sometimes 
masking individual gains or degradations. To provide a more in-depth look into what 
the updated AMs contribute towards system performance, the Phoneme Confusion 
Matrix is also used. 

3.3.1 Phoneme Error Rate - Equally Likely Phonemes. PER using an 
equally-likely phoneme “language model” (PER-ELP) means that the system perfor¬ 
mance is solely due to the AMs. As seen previously in Figure 2.2, the three com¬ 
ponents needed for the Pattern Matching Block are the LM, AM, and Dictionary. 
The LM for ELP has equal probability that any phoneme can occur after any other 
phoneme. Further, the dictionary does not contain words, but is merely a list of 
phonemes for that language. With these settings, the recognizer output is a stream of 
phonemes, where the recognizer decision is based solely on the frames of features every 
10ms and how closely they match the AMs. This output stream is then compared to 
the reference phoneme stream (derived from aligning the reference transcripts). The 
PER takes into account insertion, deletion, and substitution errors. The following 
example highlights each of these errors: 



The 

dog 

- 


jumped 


Reference: 

DH AX 

D AO GD 

JH 

AH 

M PD 

TD 

ASR output: 

DH 

D AO G 

JH 

AH 

M PD 

TD EH 

Error type: 

deletion 

substitution 

- 



insertion 


3.3.2 Word Error Rate. WER is another metric used to rate ASR systems. 
To compute a WER, the ASR system being evaluated has the same AMs as are used 
to compute the PER, but the LM is made up of word trigrams and the dictionary has 
the pronunciation for each of the words in the LM. The recognizer output is based 
not only on the AM information, but also on the statistics of the LM. A well-built 
LM can mask flaws in poorly trained AMs and vice versa. A problem with WER is 
the need for exactness. If the reference transcript contains the word “computer” but 



the recognizer outputs the word “compute,” an error is listed even though it is wrong 
by only one phoneme. 

3.3.3 Phoneme Error Rate - Word Language Model. A second form of 
PER allows the use of a trained word LM (PER-WLM) as opposed to an equally- 
likely phoneme LM. The LM is the same when running the recognizer to output 
word strings. In other words, the recognizer determines the best word given a set of 
acoustical feature frames and the LM, and that word, along with the pronunciation, 
are output. In the previous example of the words “computer” and “compute”, one 
deletion error is returned for the missing phoneme “r”. 

3.3.4 Phoneme Confusion Matrix. To fully analyze speech recognizer per¬ 
formance, a phoneme confusion matrix is used to display which phonemes an ASR 
system outputs correctly, and also which phonemes an ASR system confuses with 
other phonemes. A phoneme confusion matrix labels each row and each column with 
every valid phoneme for the ASR system employed. The rows represent the actual 
phonemes (truth) and the columns represent the output phonemes (recognizer out¬ 
put). Therefore, the diagonal of the matrix shows the number of times the recognition 
output matched the truth for a given phoneme. Reading across a row for a given 
phoneme, each other cell represents the number of times the recognizer output that 
wrong phoneme. High values off the diagonal represent highly confusable phonemes 
and show where work should be focused to improve recognition performance. 

Usually, phonemes are grouped into categories such as vowels, nasals, fricatives, 
etc., so that trends can be analyzed in the errors. One expects a long “a” and a short 
“a” to show some confusability. One does not expect a long “a” and a “s” to be 
confused often. If this second trend occurs the phoneme confusion matrix would be 
the tool used to focus on these errors and narrow system tuning. See Appendix A for 
phoneme confusion matrices. 
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3.4 Bootstrapping from English Results 

A preliminary set of experiments conducted using the bootstrapping procedure 
outlined in Section 2.6.6 starting with English AMs to build phoneme and word 
recognizers for different languages. The English AMs are trained on the Wall Street 
Journal Database [13] which consists of 73 hours of English audio data. The first step 
to the bootstrapping aligns the new language transcriptions to the audio data using 
the pronunciation dictionary for that language. In this step the English AMs and a 
Phoneme Map hie that assigns the closest English phoneme to each phoneme of the 
new language (see Tables 3.1 and 3.2 for these mappings) is used for the alignment. 
Once this initial alignment is completed for each training hie, the alignments are used 
to train three-state, 32-mixture HMMs for each phoneme of the new language. This 
AM is then used to realign the training hies, and these new alignments are used to 
train new AMs. This process is iterated three times (see Figure 2.7), and the resulting 
AMs are used for decoding the test data. 

Bootstrapping has been successfully used by others to build ASR systems in 
new languages, see [22,35-38,46]. Various methods of bootstrapping exist, especially 
in how model parameters are updated to build the TL acoustic models and to what 
extent model parameter sharing is used (if any) between the training language acoustic 
models and the TL acoustic models. 

Tables 3.3 and 3.4 list the PER-ELP, the PER-WLM, and the WER for four 
languages out of the GlobalPhone Database for both monophone and triphone AMs. 
The four languages and the amount of training data used are Croatian (12 hours), 
German (14 hours), Japanese (24 hours), and Turkish (14 hours). (Japanese tran¬ 
scripts are in a Romanized format.) Error rates improve when triphone models are 
used instead of monophone models, which shows that there is enough training data 
in the languages in GlobalPhone to build proper triphone AMs. 

The error rates are consistent with previous research conducted on the Global¬ 
Phone database [38] as shown in Table 3.5 (Croatian is not evaluated in [38]). Error 
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Croatian 

German 

Japanese 

Turkish 

PER-ELP 

36.4 

53.8 

35.8 

45.1 

PER-WLM 

19.5 

17.1 

16.8 

3.7 

WER 

47.3 

34.4 

47.7 

10.9 


Table 3.3: Error rates on four languages in the GlobalPhone database using SONIC 
to bootstrap from English AMs to monophone AMs for that language. All results are 
after three iterations of alignment (see Figure 2.7). 



Croatian 

German 

Japanese 

Turkish 

PER-ELP 

33.1 

50.0 

30.0 

40.1 

PER-WLM 

10.8 

11.5 

8.8 

3.2 

WER 

32.3 

22.3 

36.4 

10.5 


Table 3.4: Error rates on four languages in the GlobalPhone database using SONIC 
to bootstrap from English AMs to triphone AMs for that language. All results are 
after three iterations of alignment (see Figure 2.7). 


rates in Table 3.5 are for triphone AMs. Differences in error rates between those 
experiments and those discussed in this research can be attributed to the following: 
different recognition systems are used, different subsets of GlobalPhone (training and 
evaluation sets) are used, and different word language models are used. The German 
PER-ELP is higher than the rest of the languages because the number of German 
phonemes is 41, which is a much larger pool of phonemes to choose from compared 
to Croatian with 30 phonemes, Japanese with 31 phonemes, and Turkish with 29 
phonemes. The reason the Turkish PER-WLM has low error (3.2% error) is perhaps 
because of the properties of the Turkish language [6], which result in long words, 
allowing easier acoustical discrimination (assuming no OOV). 



German 

Japanese 

Turkish 

PER-ELP 

44.5 

33.8 

44.1 

WER 

11.8 

10.0 

16.9 


Table 3.5: Error rates on three languages in the GlobalPhone database as stated in 
Schultz [38]. All results are based on triphone AMs. Differences between these results 
and the results reported for SONIC include: differences in ASR systems, differences 
in the partitions of the GlobalPhone database, and differences in WLMs. 
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IV. Porting SONIC to Arabic 

4-1 Bootstrapping from English 

The first set of experiments to build an Arabic speech recognizer started with 
English AMs that were then bootstrapped into Arabic AMs with the process of Section 
2.6.6. The resulting Arabic AMs were used to decode the test data and compute 
PER-ELP, PER-WLM, and WER values. These results are named “Boot-ENG-0” 
and “Boot-ENG-3” for results after zero and after three iterations of bootstrapping, 
respectively. 

4-2 Bootstrapping from IPA-Based Multilingual Phonemes 

As indicated in Tables 3.1 and 3.2, there are multiple instances of the same 
English phoneme mapping to different Arabic phonemes. Some of the mappings are 
coarse approximations because there are phonemes in Arabic that do not occur in 
English (Arabic has a “pharyngeal voiced fricative”). The premise behind using mul¬ 
tiple languages to train the initial AMs is that such training would provide greater 
coverage of the Arabic phoneme space so that the initial Arabic alignments and the 
final Arabic AMs would be more accurate. As mentioned in Section 3.1.1, the lan¬ 
guages chosen from the GlobalPhone database have a common multilingual phoneme 
set based on the 1PA. Thus, in aligning the phoneme /f/, one could draw on /{/ AMs 
from Croatian, German, Japanese, and Turkish. On the other hand, English does not 
have a phoneme that closely corresponds to the Arabic /x/, but Croatian and German 
do, so their /x/ models might be useful in aligning the initial Arabic. There are still 
17 Arabic phonemes that are not represented by the four chosen training languages. 
These 17 phonemes are “mapped” to the next closest multilingual phoneme. “Clos¬ 
est” is determined by manner, then voicing, then place for consonants and according 
to the IPA vowel chart for vowels. Also, note that due to accent and dialect issues, 
just because all of the languages have an jij does not mean that they are all equally 
close to the Arabic jij. 
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4-2.1 Building ML-IP A Acoustic Models. The process used to build the 
initial Multilingual ML-IPA based AMs is as follows. First, all alignment information 
from each of the four training languages is gathered. This alignment information is 
from the third iteration of bootstrapping from English to that particular language. 
The alignment information contains phoneme labels (based on the IPA) and start and 
ending times that reference the audio hie pertaining to the transcript. 

To build monophone ML-IPA AMs, all training data across the four training 
languages are grouped, and a three-state, 32-mixture HMM is trained and used for 
the initial alignment of the Arabic acoustic data. For the 17 Arabic phonemes that 
do not exist in Croatian, German, Japanese, and Turkish, the base phoneme ML-IPA 
AM, as determined by the mapping hie, is used. 

The procedure used to build triphone ML-IPA AMs is more complicated. Recall 
that a triphone represents the base phoneme in the context of the preceding and 
following phoneme. When creating multilingual AMs for the purpose of Arabic speech 
recognition, one must be careful to only use phonemes that occur in Arabic. If a set 
of alignments are given to the HMM trainer that contain a non-Arabic phoneme, the 
resulting HMM model contains a context-dependent model accounting for that non- 
Arabic phoneme, and SONIC posts an error when attempting to decode with this 
model because that phoneme does not occur in the Arabic phoneme set. To ensure 
against this error, the following process is developed. For a given base phoneme, all 
multilingual training alignment data is grouped together. This pool is then refined 
by discarding any frames involving triphones that contain contexts with non-Arabic 
phonemes. This process is repeated for each base phoneme, and the result is a ML-IPA 
triphone model that exists in the phoneme space of Arabic. 

4-3 Bootstrapping from Data-Driven Multilingual Phonemes 

One issue with using the IPA-based multilingual approach is that, for Arabic, 
there are 17 phonemes that must be mapped from “close” phonemes, where the idea of 
closeness is categorical in nature and somewhat subjective. A second issue is that just 
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because several languages have the same basic phoneme according to the IPA, it does 
not follow that the phonemes all have exactly the same acoustic properties. There 
are generally differences between like phonemes across different languages. One way 
to address these issues is to implement a data-driven approach based on a measure 
of the distance between different phonemes from the different training languages and 
to allow the “close” matching phonemes to group together to form the new phoneme 
models for the initial Arabic AMs. 


4-3.1 Kullback-Leibler Distortion. The distance measure chosen is based on 
the Kullback-Leibler (KL) distortion measure [9,39], which is designed to measure the 
“difference” between Gaussians. Let N(fj, i,Ei) and Af(/i 2 ,£ 2 ) denote two Gaussian 
distributions for vectors of length d, then the KL distortion between them is 


KL d ist(Ni,N2) 



d + trace(Yj 2 x Ei) + (/ii - /i 2 ) t E 2 - /x 2 ) 


This expression is implemented by training three-state, single-mixture HMMs for 
each phoneme for each language. By using a single mixture, there is only one 39- 
dimensional Gaussian for each state. (There is no known closed-form expression 
for the KL distortion measure for GMMs, although there has been some work on 
approximations [15].) 

The KL distortion measure is not symmetric, (i.e., KL dist ^ Nli jv 2 ) ^ KLdi S t(N 2 ,N 1 )) ■ 
To form a distance measure, one can average KL dist ( NljN2 ) and KL dist ( N ^ Nl ), we refer 
to this as the KL distance measure (KLDM). One other factor is handling the three 
states (beginning, middle, ending) for each phoneme in order to obtain a single dis¬ 
tance measure between two phoneme models. Two approaches are investigated. The 
first approach equally weights the contributions of the three states, while the second 
approach weights the contributions of the states 0.25, 0.50, 0.25 under the assumption 
that the middle state, which is sometimes longer in duration, has fewer coarticulation 
effects and might be a better indicator of the distance between phoneme models. It 
is found that both weighting schemes return the same phoneme grouping trends, and 
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therefore all data-driven phoneme clusters discussed in this research are derived by 
equally weighting the state distances. 

4-3.2 Data-Driven Phoneme Clusters. To derive data-driven Arabic phoneme 
clusters, each single mixture (three-state) Arabic phoneme model is compared using 
the KLDM to each multilingual phoneme model from each of the four training lan¬ 
guages. All models (including Arabic) are those bootstrapped from English after 
three iterations. Table 4.1 displays the resulting phoneme clusters for four threshold 
values using the KLDM. Trends in the phoneme groupings show Turkish phonemes 
dominating the closest distances to Arabic, with Croatian phonemes also matching 
often. With narrow threshold values (5 and 10), the data-driven phoneme clusters 
generally match the phonetic content of the Arabic phoneme; thus, the KLDM gen¬ 
erally yields results that are intuitive. Relaxing the threshold value to 15, it is seen 
that “plosives” are grouped together in the example of /b/ and /d/ for AR-/dd/ and 
/k/, /p/, and ft/ for AR-/k/and /t/. Also note that the voicing matches in these 
two examples. Further, AR /m/ includes both nasals /m/ and /n/. 

All experiments involving these data-driven multilingual phoneme clusters are 
referred to in the remainder of this thesis as ML-DD(threshold value) experiments, 
where the threshold value is that of the KLDM. 

4-3.3 Building the ML-DD Acoustic Models. The process used to build the 
ML-DD AMs is similar to that used to build the ML-IPA models, except instead of 
the IPA labels driving the frame combinations, the phoneme clusters seen in Table 4.1 
determine which training frames are grouped. 

To build a monophone ML-DD AM for a given phoneme and threshold, all 
training data corresponding to phoneme models found to be within the threshold (or 
the closest match when no phoneme models were within threshold) are grouped and 
used to train a three-state, 32-mixture HMM, which is used for the initial alignment 
of the Arabic acoustic data. 
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Arabic 

phoneme 

Threshold 

5 

10 

15 

20 

a 

TU-ab* 

TU-ab* 

TU-ab,CR-a 

JA-ab,TU-e,i2 

al 

TU-e* 

TU-e* 

TU-e* 

TU-e* 

al 

TU-i2* 

TU-i2 

GE-etu,r,TU-r 


alal 

TU-ab* 

TU-ab* 

TU-ab* 

TU-ab* 

aU 

TU-o* 

TU-o* 

TU-o* 

TU-o* 

b 

CR-b,TU-b 


JA-b 

CR-d,TU-cl 

C 

TU-S* 

TU-S 

CR-sj,JA-S 

TU-s,tS 

Cl 

JA-S* 

JA-S* 

JA-S* 

JA-S,TU-S 

cl 

CR-d 

TU-cl 

TU-b 

CR-b,g,JA-d,TU-g 

D 

JA-d* 

JA-d* 

JA-d* 

CR-d,JA-b,d,TU-cl 

dd 

CR-d* 

CR-d* 

CR-d,JA-b,TU-b,d 

CR.-b,g,JA-d 

f 

CR-f,TU-f 


GE-f 

JA-f 

G 

TU-Z* 

TU-Z 

TU-dZ 

CR.-dp,zj 

h 

TU-h* 

TU-h* 

GE-h,TU-h* 


H 

JA-h* 

JA-h* 

JA-h* 

JA-h* 

Hq 

TU-ab* 

TU-ab* 

TU-ab* 

TU-ab* 

i 

TU-i 

CR-i 

JA-i,TU-ue 

GE-etu,i,JA-W,TU-i2 

il 

CR-i* 

CR-i* 

CR-i,TU-i 

JA-i,il 

j 

TU-j* 

TU-j 



k 

TU-k* 

CR-k,TU-k 

CR-t,JA-k,TU-p,t 

CR.-p,GE-g,k 

1 

TU-1* 

GE-l.TU-l 

CR-L 

CR.-n,r,JA-g,l,TU-n,r 

11 

CR-L* 

CR-L* 

CR-L* 

CR-L* 

m 

TU-m 

CR-m 

CR-n,GE-m,JA-m,TU-n 

JA-n,TU-1 

ml 

CR-m*,TU-m* 

CR-m*,TU-m* 

CR-m*,TU-m* 

CR-m*,TU-m* 

n 

TU-n 

CR-n 

CR-m,GE-n,TU-m 

CR-nj,JA-n,TU-1 

nl 

JA-n* 

JA-n* 

JA-n* 

JA-n 

q 

CR-k* 

CR-k 

TU-k 

JA-k,TU-p 

Q 

CR-x* 

CR-x* 

CR-x* 

CR-x* 

r 

TU-r* 

CR-r,TU-r 


CR-l,GE-l,r,JA-l,TU-i2 

rl 

CR-r* 

CR-r* 

CR-r* 

CR-r 

rr 

TU-h* 

TU-h* 

TU-h 

CR-v,JA-g,TU-l,v 

s 

TU-s 

CR-s 

GE-s,z,JA-s 

TU-z 

S 

TU-s* 

TU-s 

GE-s,z,JA-s 

TU-z 

si 

TU-s* 

TU-s 


CR-s,JA-s 

SI 

TU-s* 

TU-s 


CR-s,JA-s 

t 

TU-t 

CR-t 

GE-t,TU-k,p 

CR-k,p,GE-d,JA-k,t 

T 

TU-f* 

TU-f 

CR-f 

CR-x 

td 

TU-t* 

CR-t,TU-t 

CR-p,TU-p 

CR-k,GE-t,JA-t,TU-k 

u 

TU-u 

CR-u 

CR-o,GE-u 

JA-o,TU-i2,o 

ul 

GE-u* 

GE-ol,u 

CR-o,u,TU-o,u 

GE-o,ul,JA-o 

w 

CR-l*,u* 

CR-l*,u* 

CR-l*,u* 

CR-l*,u* 

X 

CR-x* 

CR-x* 

CR-x* 

CR-x 

z 

CR-z*,TU-z* 

CR-z,TU-z 


GE-z,JA-z 

Z 

JA-g*,TU-l* 

JA-g*,TU-l* 

JA-g*,TU-l* 

JA-g,TU-l 


Table 4.1: Data-driven phoneme groupings based on the KL-distance metric for 

various thresholds. The first two letters signify the source language (CR-Croatian, 
GE-German, JA-Japanese, TU-Turkish), while the letters after the hyphen signify the 
phoneme label. As the threshold increases, only new phonemes are listed; all previous 
phoneme groupings still hold. The (*) signifies that no phoneme is below the given 
threshold, and the phoneme listed is the closest in distance to the Arabic phoneme. 
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To illustrate this process for triphones, an example is given first for the base 
phoneme AR-/i/ with a KLDM threshold of 10 and then discussed in further detail. 

Given that the Turkish training data contains the triphone: TU /f/-/i/-/m/, 
Table 4.1 shows that, TU-/f/ maps to AR-/f/ and AR-/T/ at a KLDM of 10. Also, 
TU-/m/ maps to AR-/m/ and AR-/ml/ at a KLDM of 10. Therefore, the training 
frames affiliated with the Turkish sequence /f/-/i/-/m/ are added to the multilingual 
training data to create the following four Arabic contexts: /f/-/i/-/m/, /f/-/i/-/nil/, 

/ T /7i/7 m A and / T /-/i/-/ m1 /- 

Thus, to build triphone ML-DD AMs, steps similar to those used to build the 
ML-IPA AMs are employed. For each Arabic base phoneme and for a given KLDM 
threshold, the set of clustered phoneme multilingual training data is input. Each 
triphone in the multilingual training data is then examined to see if it contains valid 
Arabic phonemes. If so, the triphone training frames are stored, otherwise the tri¬ 
phone data are discarded. If multiple mappings exist for a training phoneme, that 
triphone is repeated to cover all mappings. 

4-4 Bootstrapping Results 

Figure 4.1 shows error rates for the three different metrics used to evaluate 
the different modeling methods. In all experiments the pronunciation dictionary and 
language model are held constant; only the AM varied. These error rates are computed 
using the program Sclite [10] and represent the average of the errors for the five Arabic 
speakers (589 utterances) in the evaluation set. 

The first four bar groupings in Figures 4.1 and 4.2 show the results of decoding 
the Arabic speech with AMs that do not contain any Arabic data. From the PER- 
ELPs, it is seen that the monophone performance does not vary greatly between AMs, 
but for the triphones the ML-IPA AM performs 12% better (in an absolute sense) 
than the best ML-DD AM. For PER-WLM, the ML-DD10 monophone AM performs 
best, with the other ML-DD monophone AMs not much reduced in performance. 
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The ML-DD10 triphone AM performs the worst of the triphone models for this error 
metric. For the WER, the ML-DD10 monophone models performs the best (as with 
the PER-WLM); however, for triphones, ML-DD20 models performed the best (as for 
the PER-WLM). Because the monophone ML-DD10 AM returns the best WER out 
of the three ML-DD experiments, only this KLDM threshold is investigated further 
with the bootstrapping and adaptation iterations. 

The next two groupings of bars in Figures 4.1 and 4.2 show the results of the 
different error metrics after the initial bootstrap of Arabic data (0 iter) and after 
three iterations of bootstrapping (3 iter). All three metrics show that the ML-DD10 
AM performs best on the initial alignment of the Arabic for both monophones and 
triphones. After three iterations of bootstrapping the ML-DD10 is still best for mono¬ 
phones but not for triphones; although the difference is only 1.3% worse for the 
PER-ELP, only 0.2% worse for the PER-WLM, and only 0.8% worse for the WER 
(all percents in an absolute sense). Overall, after three iterations of bootstrapping, 
all three modeling procedures tend to converge to similar performance levels. Ta¬ 
bles A.1-A.12 show the phoneme confusion matrices for each of these experiments 
and are discussed in further detail in the Appendix. 

There are notable challenges with the version of the transcripts and dictionary 
used in these experiments which may account for high WER and PER values. The first 
is the added complexity in the dictionary of the prefix AL (Arabic for “the”) attached 
to many words. Entries exist for both the root word and the root word with the prefix 
attached. Also, sometimes AL is written as AeL. By adding these entries, there is a 
larger search space and a greater chance of error. The second issue with the transcripts 
is their inconsistency. With the aforementioned discussion about dictionary entries 
with and without prefixes, the transcripts are found to sometimes have the prefix 
attached and sometimes not. Also, in the transcripts certain phrases (perhaps Arabic 
colloquialisms) have multiple words grouped together by “underbars”. The phrases 
are left “as is”, and dictionary entries are created to account for these “words”. 
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ML-IPA ML- ML- ML- Boot- Boot- Boot- Boot- Boot- Boot- 

DD10 DD15 DD20 Eng-0 IPA-0 DD10-0 Eng-3 IPA-3 DD10-3 


Figure 4.1: Monophone AM results on Arabic test data for Bootstrapping Experi¬ 
ments. The first grouping shows results of using AMs with no Arabic acoustic data, 
the second grouping is after an initial alignment and training stage of Arabic data, 
and the third grouping is after three iterations of bootstrapping. 



ML-IPA ML- ML- ML- Boot- Boot- Boot- Boot- Boot- Boot- 

DD10 DD15 DD20 Eng-0 IPA-0 DD10-0 Eng-3 IPA-3 DD10-3 


Figure 4.2: Triphone AM results on Arabic test data for Bootstrapping Experi¬ 

ments. The Erst grouping shows results of using AMs with no Arabic acoustic data, 
the second grouping is after an initial alignment and training stage of Arabic data, 
and the third grouping is after three iterations of bootstrapping. 
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These phrases do not occur repeatedly in the training data, and therefore affect the 
LM statistics. 

4-5 Adapting Multilingual AMs to Arabic 

The motivation behind trying adaptation as opposed to bootstrapping is the fact 
that bootstrapping discards all previous AMs and only uses the most recent model 
trained on whatever acoustic data is available for the given target language. In the 
case of small resource languages, if the initial AM has a reasonable amount of coverage 
of the target language phoneme set, adaptation might tune the acoustic model while 
still allowing generalization of the initial model trained on larger amounts of data. 

Using the SMAPLR adaptation scheme as discussed in Section 2.6.5, an alter¬ 
native porting method to bootstrapping is evaluated. An initial multilingual AM is 
adapted with training data from the Arabic training set. Initial AMs were either ML- 
IPA or ML-DD10 AMs. English AMs were not used because of difficulties getting 
access to the exact alignments that went into the English models. The SMAPLR ap¬ 
proach automatically creates regression class transformations and adjusts the mean 
and variance values of the HMM components. The phoneme duration parameters 
(represented by a gamma distribution) are not modified. 

As indicated in Figures 4.3 and 4.4, multiple iterations of adaptation improve 
ASR performance; however, after three iterations the adapted AMs do not outperform 
the bootstrapped AMs. Since adaptation takes substantially more computational 
power and time than bootstrapping, no further model adaptation experiments are 
conducted. 

4-6 Supplementing with IP A and Data-Driven Multilingual Data 

A last set of experiments investigated supplementing the Arabic data with mul¬ 
tilingual data. The motivation is that if some multilingual data matches close enough 
to the Arabic data, it might be utilized to increase the amount of training data for the 
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Monophone 


| PER-ELP 



Boot- Adapt- Adapt- . Boot- Adapt- Adapt- 
Eng-0 IPA-0 DD10-0 Eng-3 IPA-3 DD10-3 


Figure 4.3: Monophone AM adaptation PER-ELP results on the Arabic test data. 
Recognition performance improves after three iterations of adaptation but does not 
outperform the baseline bootstrapped AMs. 



Boot- Adapt- Adapt- . Boot- Adapt- Adapt- 
Eng-0 IPA-0 DD10-0 Eng-3 IPA-3 DD10-3 


Figure 4.4: Triphone AM adaptation PER-ELP results on the Arabic test data. 

Recognition performance improves after three iterations of adaptation but does not 
outperform the baseline bootstrapped AMs. 
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Arabic models. The first step of the supplementation process is to bootstrap up Ara¬ 
bic, Croatian, German, Japanese, and Turkish AMs using the standard bootstrapping 
process starting with English AMs. After three iterations of bootstrapping, these final 
phonetic alignments are saved. The next step depends on whether the “closeness” of 
the multilingual data is determined by IPA or by the data-driven approach. 

The first supplementation experiment uses the IPA labels to group the multilin¬ 
gual data with the Arabic data. For each base phoneme, multilingual acoustic data 
that matches the Arabic phoneme labels are added to the Arabic data while taking 
careful consideration not to include non-Arabic phonemes in any triphone context as 
discussed in Section 4.2.1. 

The next group of supplementation experiments uses the data-driven clusters 
indicated in Table 4.1. The Erst set of experiments supplemented the Arabic data 
with the multilingual phonemes listed in the column corresponding to a given KLDM 
threshold, resulting in supplementing the Arabic data with multilingual data from 
phonemes either within threshold or with the closest matching multilingual phoneme. 
The second set of experiments (designated with a b after the threshold) requires 
the supplementation data to be at or below the given threshold. If no multilingual 
data matches close enough to the Arabic data, that Arabic phoneme model is not 
supplemented with any extra data. Again, the triphone models are built taking into 
consideration non-Arabic phoneme contexts as discussed in Section 4.3.3. 

Figures 4.5 and 4.6 display the PER-ELPs for all these supplementation ex¬ 
periments. None of the overall PER-ELPs outperform the baseline three iteration 
bootstrapping from English AMs results. However, there are differences in the indi¬ 
vidual phoneme performances as seen in the confusion matrices shown in Tables A. 13, 
A.14, A.15, and A.16. 
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Monophone 



Figure 4.5: Monophone AM supplementation PER-ELP results on the Arabic test 
data. All AMs stated as bootstrapped from English AMs, with multilingual training 
data supplementing the Arabic data with different levels of supplementation. 
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Figure 4.6: Triphone AM supplementation PER-ELP results on the Arabic test 

data. All AMs stated as bootstrapped from English AMs, with multilingual training 
data supplementing the Arabic data with different levels of supplementation. 
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V. Conclusions 


5.1 Review 

In this work multilingual phoneme models are investigated for porting to a 
new language. Croatian, English, German, Japanese, and Turkish training data are 
used in various approaches to build an Arabic ASR system. The first approach used 
standard bootstrapping methods from English AMs. Next IPA-based labels are used 
to create multilingual AMs that are then bootstrapped to Arabic AMs. Finally, data- 
driven phoneme clusters based on the KLDM are used to build multilingual AMs that 
are then bootstrapped to Arabic AMs. The initial bootstrapped data-driven AMs 
return the lowest WER on the Arabic evaluation set at 73.4% for monophones and 
62.7% for triphones. After three iterations of bootstrapping the data-driven phoneme 
clusters return the lowest PER-ELP for monophones at 51.3% and the IPA-based 
AMs returned the lowest PER-ELP for triphones at 45.4%. 

SMAPLR adaptation is employed to adapt IPA-based multilingual AMs and 
data-driven multilingual AMs to Arabic data. While multiple iterations of the adap¬ 
tation improve the AMs, the rate of improvement is not as rapid as for the bootstrap¬ 
ping methods. After three iterations of SMAPLR, no adapted AM outperformed any 
bootstrapped AM. 

Using IPA labels and the KLDM, multilingual training data are used to sup¬ 
plement Arabic training data and to build new Arabic AMs. The phoneme confusion 
matrices from these experiments show that certain phonemes increase in performance 
from the supplementation of the multilingual data, but the overall PER-ELP does 
not show any improvement from the standard method of bootstrapping. 

5.2 Future Work 

The results of the phoneme clusters from the KLDM provide for some interesting 
possibilities. If the KLDM is run on the Arabic AMs, close matching Arabic phonemes 
may point to an area of high confusability and the need for more focused training 
data for the particular phonemes. This practice could be applied to other languages 
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to evaluate the AM space of an ASR and to point to problems with the AMs before 
testing on an evaluation set is complete. Another possibility is to compute KLDMs for 
triphones. Also, regarding the KLDM, the addition of phoneme duration information 
could be added as parameter(s) to the distance measure. Extending the KLDM to 
multiple mixtures with an approximation is another extension to this work. 

These experiments could also be repeated using the HMM Toolkit (HTK) soft¬ 
ware package instead of SONIC to confirm results and to compare the forward- 
backward training methodology, as opposed to the align-train-realign Viterbi-based 
methodology of SONIC. 

Also, modifications to the Arabic dictionary and transcripts should be made 
to remove inconsistencies and the appended word colloquialisms. Changes to the 
dictionary would have drastic effects on automatic word alignment, which directly 
affects the way the final AMs are trained. By reducing the error rates in general, 
different trends using the multilingual modeling approaches may be found. 

Other experiments could entail constraining the amount of TL training data 
even further (1, 5, or 10 hours) and to compare how the bootstrapping and adaptation 
methods perform. 

Combining the results from each AM experiment may result in overall lower 
error rates. It can be seen that each different AM approach produced different errors 
across the phoneme set, so combining system results could build upon the benefits of 
each of the approaches. Another way of combining AM approaches is a merger of the 
IPA and DD methods: use the IPA mapping for all “known” phonemes in the TL, 
and use the DD methods for all “unknown” phonemes in the TL. 

Finally, changing the number and set of languages used for training and testing 
could be used to verify the consistency of these results and to measure the robustness 
of these techniques. What if more languages were added to the language training set? 
What if the number was kept the same but languages were changed? How does the 
number of “unknown” phonemes in the TL affect the outcome of the AM approaches? 
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Appendix A. Phoneme Confusion Matrices 

The following tables display phoneme confusion matrices (as described in Section 
3.3.4), for various experiments run on the Arabic evaluation data of the GlobalPhone 
database. All percentages represent results based on the PER-ELP metric. 

For tables that show recognition results in percent, high values along the diago¬ 
nal and low values in all other cells are signs of a well-performing recognition system. 
For tables that compare recognition results between two systems with percent differ¬ 
ences, positive values along the diagonal and negative values in off-diagonal cells show 
where the system improved compared to the baseline system. 

Table A.l (page 50) shows the phoneme confusion matrix for the results of 
decoding the Arabic evaluation data with the initial monophone AM bootstrapped 
from English AMs with zero iterations. The phoneme labels are arranged to group 
broad phonetic classes together. Phonemes that are difficult to recognize (< 20% 
correct) include /ml/, /Z/, /si/, /SI/, and /Cl/. Phonemes that are easier to recognize 
(> 70% correct) include /aU/, /f/, /C/, /x/, /G/, /b/, and /k/. 

Table A.2 (page 51) compares the PER-ELP-based performance differences us¬ 
ing the monophone ML-IPA bootstrapped AM relative to using the monophone En¬ 
glish bootstrapped AM. All values are in percent, and positive numbers along the 
diagonal show where the value increases using the ML-IPA AM versus the English 
bootstrapped AM, and negative numbers along the diagonal show where the value de¬ 
creases using the ML-IPA AM versus the English bootstrapped AM. The phonemes 
/al/, /il/, /u/, /l/, /n/, /f/, and /dd/ increase the most (by over 5% each), while 
/i/, /r/, and /si/ decrease the most (by over 5% each). 

Table A.3 (page 52) compares performance differences using the monophone 
ML-DD10 bootstrapped AM relative to using the monophone English bootstrapped 
AM. This table shows even greater differences than ML-IPA AM. The phonemes /a/, 
/u/, /l/, /r/, /n/, /Z/, /Q/, and /Hq/ show the greatest improvement, while the 
phonemes /al/, /rl/, /nl/, and /si/ show the greatest decline in performance. In 
general, all long phonemes (designated with a l ) show some degree of degradation 
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using the ML-DD10 bootstrapped AM, but their corresponding short versions have 
performance gains. 

Tables A.4, A.5, and A.6 show the same order of experiments as just discussed, 
but after three iterations of bootstrapping the AMs. After three iterations the AMs 
bootstrapped from English have the same difficulty in recognizing the phonemes: 
/ml/, /Z/, /si/, /SI/, and /Cl/ (see Table A.4, page 53). The phonemes that are 
easier to recognize are the same as for the initial bootstrapped AMs: /aXJ/, /f/, /C/, 
/x/, /G/, /b/, and /k/, with the addition of /m/ and /Hq/. 

Table A.5 (page 54) compares the performance difference between three iter¬ 
ations of ML-IPA bootstrapped monophones and three iterations of English boot¬ 
strapped monophones. The overall difference in PER-ELP is a 0.5% improvement for 
the ML-IPA AM, and no phonemes improve by at least 5%, but /i/ and /rl/ decrease 
by at least 5%. 

Table A.6 (page 55) compares the performance difference between three itera¬ 
tions of ML-DD10 bootstrapped monophones and three iterations of English boot¬ 
strapped monophones. The PER-ELP improves by 1.3% absolute, and the phoneme 
confusion matrix shows that /alal/ and /I/ improve and /al/ and /rl/ degrade the 
most. 

Table A. 7 (page 56) shows the phoneme confusion matrix for decoding the Ara¬ 
bic evaluation data with the initial triphone AMs bootstrapped from English AMs. 
The phonemes that are difficult to recognize include /ml/, /Z/, /si/, /SI/, and /Cl/. 
Phonemes that are easier to recognize include /aU/, /f/, /C/, /x/, /G/, /b/, /k/, 
/a/, and /nr/. 

Table A.8 (page 57) compares the performance differences of using the triphone 
ML-IPA bootstrapped AMs and the triphone English bootstrapped AMs. The PER- 
ELP for the initial ML-IPA AMs is 1.2% worse than the English bootstrapped AMs, 
but the following phonemes improve performance by at least 5%: /alal/, /al/, /il/, 
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/ul/, /l/, /rr/, and /dd/. Five phonemes are degraded in performance by at least 
5%: /i/, /u/, /x/, /td/, and /Hq/. 

Table A.9 (page 58) compares the performance differences of using the triphone 
English bootstrapped AMs to the triphone ML-DD10 bootstrapped AMs. The PER- 
ELP for the initial ML-DD10 AMs is 2.3% better than the English bootstrapped 
AMs. Five phonemes, /u/, /l/, /r/, /rr/ and /Hq/ improve recognition by at least 
5% and one phoneme, /si/, is degraded in performance by more than 5%. 

Table A. 10 (page 59) shows the phoneme confusion matrix for decoding the Ara¬ 
bic evaluation data after three iterations of bootstrapping starting from the triphone 
English AMs. After three bootstrapping iterations, the phonemes that are difficult to 
recognize remain constant; however, /r/ and /Hq/ are added to the list of phonemes 
that are easier to recognize, while /aLJ/ downgrades to 67.9% correct. 

Table A.ll (page 60) compares the performance differences of using the tri¬ 
phone ML-IPA bootstrapped AMs after three iterations and the triphone English 
bootstrapped AMs after three iterations. The PER-ELP between the two experi¬ 
ments is identical at 45.4%, but /ul/ is recognized 7.7% better with the ML-IPA AM 
and /ll/, /al/, and /u/ are recognized 7.6%, 5.5%, and 5.2% worse with the ML-IPA 
AM respectively. All other phonemes have a change of less than ± 5%. 

Table A. 12 (page 61) compares the performance differences of using the tri¬ 
phone ML-DD10 bootstrapped AMs after three iterations and the triphone English 
bootstrapped AMs after three iterations. The PER-ELP with the ML-DD10 AM in¬ 
creases to 46.7%, due in part to the fact that /a/, /ll/ and /dd/ are recognized 11.7%, 
6.6%, and 8.9% worse than the English bootstrapped AM, respectively. However, the 
phoneme /ul/ is recognized 8.9% better than the English bootstrapped AM. 

Table A. 13 (page 62) compares the performance difference between three itera¬ 
tions of English bootstrapped monophones to this AM supplemented with monophone 
data based on IPA labels . The overall difference is 3.4% worse in PER-ELP for the 
ML-IPA supplemented AM, but from the phoneme confusion matrix it can be seen 
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that the individual phoneme performances varies drastically. Six phonemes are im¬ 
proved in correct recognition by 5.0% to 17.7%, while 13 phonemes degrade in correct 
recognition by -6.5% to -43.2%. 

Table A. 14 (page 63) shows the results of reducing the amount of supplemen¬ 
tation data. With the threshold set to 5b, only the following Arabic phonemes are 
supplemented: {/i/, /u/, /m/, /n/, jij, /s/, /b/, /t/, /d/}, and these phonemes 
are only supplemented by multilingual data that match with a KLDM of “five” or 
less. Out of this set of phonemes, all increase in recognition performance except /m/ 
(-1.0%) and /s/ (-0.9%). Note that, /nl/ is recognized 9.1% better with these AMs. 

Table A. 15 (page 64) compares the performance difference between three itera¬ 
tions of English bootstrapped triphones to this AM supplemented with triphone data 
based on IPA labels. The PER-ELP is degraded by 11.4%, and 21 phonemes decrease 
in recognition performance by -6.4% to -38.5%. The phonemes /ml/, /nl/, /Z/, /si/, 
and /SI/ increase in recognition performance by 7.1% to 10.5%. 

Table A. 16 (page 65) shows the results of supplementing the English boot¬ 
strapped triphone AM with multilingual data that match within a KLDM of “five” 
or less. The phoneme set is the same as mentioned before, {/i/, /u/, /nr/, /n/, /f/, 
/s/, /b/, /t/, /d/}, and each of these phonemes increase in recognition performance 
except /nr/ (-0.2%) and /d/ (-1.9%). 
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Tabic A.l: Phoneme Confusion Matrix for monophone AM bootstrapped from English AM with zero iterations. All values 
are in percent with the diagonal showing percent correct for each phoneme. PER-ELP for this experiment is 57.2%. 
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zero iterations to the monophone AM bootstrapped from ML-IPA. All values are in percent with positive values along the 
diagonal showing an increase in recognition performance and negative values showing a decrease in recognition performance. 
PER-ELP for this experiment is 56.2%. 











PP 



r 7P 

QJ 

a 

Oh 

C3 

Sh 

•+J 

CC 

O 

o 

P2 


C 

o 

Pi 

o 

r~{ 

a 

o 

Pi 

o 


QJ 

r~j 

-+^> 

.s 

m 

CD 

CD 

£ 

CD 

?H 

P <D 


b€ 

.S 

’G 

a3 

a 


o 

CD 

‘C 


3 

.2 

*3 

o 

O 


CD 

£ 

O 

Oh 


CO 

«i 

JD 

3 

Eh 


52 


zero iterations to the monophone AM bootstrapped from ML-DD10. All values are in percent with positive values along the 
diagonal showing an increase in recognition performance and negative values showing a decrease in recognition performance. 
PER-ELP for this experiment is 51.7%. 
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Tabic A.4: Phoneme Confusion Matrix for monophone AM bootstrapped from English AM with three iterations. All 

values are in percent with the diagonal showing percent correct for each phoneme. PER-ELP for this experiment is 52.6%. 
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three iterations to the monophone AM bootstrapped from ML-IPA. All values are in percent with positive values along the 
diagonal showing an increase in recognition performance and negative values showing a decrease in recognition performance. 
PER-ELP for this experiment is 52.1%. 
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three iterations to the monophone AM bootstrapped from ML-DD10. All values are in percent with positive values along the 
diagonal showing an increase in recognition performance and negative values showing a decrease in recognition performance. 
PER-ELP for this experiment is 51.3%. 
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Tabic A.7: Phoneme Confusion Matrix for triphone AM bootstrapped from English AM with zero iterations. All values 
are in percent with the diagonal showing percent correct for each phoneme. PER-ELP for this experiment is 48.5%. 
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zero iterations to the triphone AM bootstrapped from ML-IPA. All values are in percent with positive values along the 
diagonal showing an increase in recognition performance and negative values showing a decrease in recognition performance. 
PER-ELP for this experiment is 49.7%. 
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zero iterations to the triphone AM bootstrapped from ML-DD10. All values are in percent with positive values along the 
diagonal showing an increase in recognition performance and negative values showing a decrease in recognition performance. 
PER-ELP for this experiment is 46.2%. 
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Tabic A.10: Phoneme Confusion Matrix for triphone AM bootstrapped from English AM with three iterations. All values 
are in percent with the diagonal showing percent correct for each phoneme. PER-ELP for this experiment is 45.4%. 
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three iterations to the triphone AM bootstrapped from ML-IPA. All values are in percent with positive values along the 
diagonal showing an increase in recognition performance and negative values showing a decrease in recognition performance. 
PER-ELP for this experiment is 45.4%. 
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three iterations to the triphone AM bootstrapped from ML-DD10. All values are in percent with positive values along the 
diagonal showing an increase in recognition performance and negative values showing a decrease in recognition performance. 
PER-ELP for this experiment is 46.7%. 
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with three iterations to this AM supplemented with monophone data from IPA labels. All values are in percent with 
positive values along the diagonal showing an increase in recognition performance and negative values showing a decrease 
in recognition performance. PER-ELP for this experiment is 56.0%. 
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of ”5” to the Arabic phonemes. All values are in percent with positive values along the diagonal showing an increase in 
recognition performance and negative values showing a decrease in recognition performance. PER-ELP for this experiment 
is 52.7%. 
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three iterations to this AM supplemented with triphone data from IPA labels. All values are in percent with positive values 
along the diagonal showing an increase in recognition performance and negative values showing a decrease in recognition 
performance. PER-ELP for this experiment is 56.8%. 
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three iterations to this AM supplemented with triphone data from multilingual data matching within a distance of ”5” to 
the Arabic phonemes. All values are in percent with positive values along the diagonal showing an increase in recognition 
performance and negative values showing a decrease in recognition performance. PER-ELP for this experiment is 45.7%. 










Appendix B. Additional Information 


Phoneme 

Example 

Phoneme 

Example 

AA 

father 

DD 

had 

AE 

mad 

KD 

talk 

AH 

but 

JH 

Jerry 

AO 

for 

K 

kitten 

AW 

frown 

L 

listen 

AX 

alone 

M 

manager 

AXR 

butter 

N 

nancy 

AY 

hire 

NG 

fishing 

B 

bob 

OW 

cone 

CH 

church 

OY 

boy 

D 

don’t 

P 

pop 

PD 

top 

R 

red 

TD 

lot 

S 

sonic 

DX 

butter 

TS 

bits 

DH 

them 

GD 

mug 

EH 

bed 

SH 

show 

ER 

bird 

T 

tot 

EY 

state 

TH 

thread 

F 

friend 

UH 

hood 

G 

grown 

UW 

moon 

HH 

had 

V 

very 

IH 

bitter 

w 

weather 

IX 

roses 

Y 

yellow 

IY 

beat 

Z 

bees 

BD 

tab 

ZH 

measure 


Table B.l: American English phoneme set used by SONIC [28]. 


Language 

Words in Dictionary 

Arabic 

47,688 

Croatian 

24,186 

German 

40,706 

Japanese 

32,543 

Turkish 

31,944 


Table B.2: Number of words in each language’s dictionary. 
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Phoneme 

Croatian 

German 

Japanese 

Turkish 

Arabic 

a 

(9749) 70.0 

(4009) 78.7 



(21845) 47.0 

al 


(1645) 112.9 



(10010) 38.3 

alal 





(60) 106.8 

ab 



(12548) 68.6 

(8860) 74.7 


abl 



(185) 125.4 



ae 


(271) 106.0 




al 


(1275) 113.2 



(420) 86.2 

atu 


(1444 60.0 




aU 


(482) 135.5 



(444) 97.5 

b 

(1283) 78.4 

(1776) 68.4 

(953) 68.9 

(1858) 74.0 

(2376) 71.9 

cp 

(633) 121.0 





C 


(1101) 96.4 



(531) 111.1 

Cl 





(143) 104.1 

d 

(2965) 63.0 

(4203) 50.8 

(2088) 57.2 

(3316) 64.6 

(2168) 77.6 

dd 





(502) 64.9 

dp 

(191) 91.6 





dZ 

(11) 57.3 


(1318) 90.3 

(970) 92.8 


D 





(417) 63.4 

e 

(7670) 56.2 

(2464) 64.1 

(4864) 70.5 

(8596) 72.2 


el 


(2405) 69.3 

(1694) 124.8 



etu 


(7252) 45.2 




ell 


(335) 115.7 




f 

(189) 100.0 

(2545) 107.5 

(319) 76.4 

(346) 104.7 

(1662) 90.8 

g 

(1451) 63.8 

(1975) 65.5 

(2218) 55.3 

(1013) 84.5 


G 





(994) 90.4 

h 


(819) 69.4 

(1808) 69.8 

(762) 64.5 

(1417) 67.0 

H 





(1103) 108.8 

Hq 





(2274) 66.9 

i 

(8047) 58.2 

(4340) 51.4 

(9975) 54.5 

(7312) 54.6 

(9076) 44.5 

il 


(2408) 73.2 

(160) 129.3 


(3774) 77.7 

i2 




(3939) 45.0 


j 

(3293) 50.3 

(347) 66.5 

(2149) 58.7 

(2538) 69.2 

(2069) 62.5 

k 

(3353) 85.1 

(1826) 96.2 

(7625) 76.4 

(3559) 91.6 

(1017) 98.6 

1 

(2190) 55.1 

(3118) 56.1 

(3501) 48.5 

(5548) 49.7 

(7955) 44.4 

11 





(399) 66.0 

L 

(421) 57.8 





m 

(2450) 80.1 

(2252) 80.7 

(2311) 81.1 

(2719) 72.5 

(4262) 64.6 

ml 





(180) 76.5 


Table B.3: Count and average duration of each multilingual phoneme in the Glob- 
alPhone test subset. The number of times the phoneme occurs is in followed 
by the average duration (in ms) as determined by the Viterbi-based alignment of the 
reference transcripts by using the third iteration of English bootstrapped AMs for 
that language. 
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Phoneme 

Croatian 

German 

Japanese 

Turkish 

Arabic 

n 

(4906) 60.0 

(9179) 66.5 

(5179) 59.8 

(6273) 62.4 

(3805) 65.8 

nl 





(461) 72.9 

ng 


(833) 88.4 




nj 

(617) 82.2 





nq 



(3981) 81.3 



o 

(7985) 61.5 

(1396) 72.3 

(7732) 69.8 

(1826) 96.2 


ol 


(1187) 92.1 

(3043) 126.5 



oe 


(126) 77.1 


(854) 89.7 


ocl 


(125) 91.1 




P 

(2752) 94.3 

(1041) 94.7 

(264) 73.8 

(745) 102.3 


q 





(1489) 105.6 

Q 



(941) 93.6 


(4002) 62.1 

r 

(5011) 46.8 

(6131) 48.8 


(5430) 47.1 

(3043) 51.1 

rl 





(237) 62.1 

rr 





(215) 62.6 

s 

(4334) 101.0 

(2990) 107.9 

(3057) 96.9 

(2752) 119.8 

(1990) 103.6 

si 





(294) 101.9 

sj 

(565) 114.4 





S 


(1521) 109.2 

(2613) 104.2 

(1112) 117.5 

(489) 99.1 

SI 





(129) 100.5 

sft 




(740) 53.7 


t 

(3898) 71.7 

(5991) 70.9 

(4454) 69.2 

(3228) 88.7 

(4557) 78.2 

td 





(848) 89.5 

ts 

(1334) 103.7 

(1762) 119.6 

(987) 94.5 



ts 

(811) 109.2 


(849) 104.4 

(679) 108.8 


T 





(511) 89.0 

u 

(3865) 72.5 

(2157) 65.7 


(2496) 55.2 

(3201) 52.3 

ul 


(752) 76.6 



(1128) 81.1 

ue 


(348) 53.9 


(1411) 53.0 


uel 


(375) 76.5 




V 

(3218) 50.8 

(1628) 66.0 


(838) 62.9 


w 



(878) 83.1 


(2392) 59.0 

W 



(6301) 46.6 



W1 



(1099) 101.1 



X 

(969) 76.8 

(695) 93.8 



(543) 107.7 

z 

(1624) 89.9 

(1530) 92.8 

(635) 74.6 

(1052) 91.0 

(481) 94.8 

zj 

(390) 94.1 





Z 




(61) 119.8 

(187) 55.2 


Table B.4: Continuation of the count and average duration of each multilingual 

phoneme in the GlobalPhone test subset. The number of times the phoneme occurs 
is in “()”, followed by the average duration (in ms) as determined by the Viterbi- 
based alignment of the reference transcripts by using the third iteration of English 
bootstrapped AMs for that language. 
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